library(foreign)
library(car)
## Loading required package: carData
library(nnet)
library(ggplot2)
library(reshape2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
## 
##     recode
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(caret)
## Loading required package: lattice
library(yardstick)
## For binary classification, the first factor level is assumed to be the event.
## Use the argument `event_level = "second"` to alter this as needed.
## 
## Attaching package: 'yardstick'
## The following objects are masked from 'package:caret':
## 
##     precision, recall, sensitivity, specificity
library(ggplot2)

1. Introduction

In recent years, California has seen record-breaking numbers of wildfires. The wildfires impact wildlife and residents and also have serious environmental implications. These outcomes result from heightened CO2 levels, temperatures, and precipitation. Previous exploratory analysis has shown that lower soil moisture and rainfall led to larger fires between 1992 to 2015, while lower soil moisture and rain led to more wildfires caused by equipment use. It was also found that higher rainfall resulted in a higher chance that a wildfire was caused by lightning. Government spending seemed to have a marginal impact on suppressing wildfires in California.

With climate change being an increasingly relevant issue these days, this research studies how wildfires in California have changed over time. Specifically, we focused on wildfires from 1992 to 2013 and sought to investigate factors that may have contributed to a significant impact on the intensity or volume of wildfires in California over the years.

The first dataset we found detailed 24 years’ worth of nationwide wildfires from 1992 to 2015 along with their associated sizes, causes, time of occurrence, and other metadata. We felt that analyzing every state’s wildfire patterns would be an unfocused and unfruitful effort, so we decided to focus on California, as it deals with the most wildfires compared to other states in the United States.

However, we needed more data to conduct a more in-depth analysis, so we retrieved data from Cal-Adapt, which is an effort developed by the Geospatial Innovation Facility at the University of California, Berkeley focused on providing data that portrays climate change in California. From here, we were able to collect data about California’s daily air temperature, soil moisture, and rainfall. From there, we were able to combine it with our wildfire data by aggregating monthly averages. We also added California’s yearly budget to our combined dataset because we were curious as to how it played a factor in the volume and intensity of wildfires over the years.

2. Basic EDA of wildfires in California

The initial exploratory data analysis made it clear that most fires in California were correlated with high average air temperatures, low soil moisture, and periods of low rainfall. To predict factors that may cause future wildfires, the following questions need to be addressed:

To answer the first question, we used a correlation matrix to study the impact of temperature, soil moisture, and rainfall on the fire size. The correlation matrix showed a weak positive correlation between average air temperature and the fire size, whereas there was a strong negative correlation between fire size and soil moisture and fire size and rainfall.

We studied the frequency of different fire class sizes to answer the second question. It was observed that the fire size class in California has a high frequency of class A and B fire sizes. These fire class sizes of wildfires are not considered a threat to the environment and nearby properties since they typically disappear soon. The fire class size C and higher could be dangerous as the burn size grows fast in such fire cases.

To answer the second question, we looked at plots of California’s fire suppression budget and wildfire count over the years to see if there was anything noteworthy. Below, we can see that there is no strong relationship between the money spent controlling wildfires and the number of wildfires that occur each year.

3.Large fire predictor

Through EDA, we found that the fire size class in California has a very high frequency in A and B fire class size. From the dataset, we can find that wildfires in class A and B sizes are observed every day, and those kinds of wildfires are self-limiting and are expected to disappear soon. As previously mentioned, other kinds of wildfires fall under fire class size C or higher. Those kinds of wildfires are not typically observed frequently and could burn over a larger area. Such wildfires are a threat to the environment, wildlife, and people living in the area. The size C fire class and above are thus, considered dangerous as the burned size grows fast. So, this research focuses on the analysis of larger fires. The class of fire with C or above appears in 4410 days in California from 1992 to 2015. We are trying to make a model that predicts the chances of getting a wildfire by given average temperature, soil moisture, and rainfall by building a logit regression model.

In the dataset, we labeled the days of no observation with 0 and days of at least 1 observation with 1. During training, we try several configurations on input features. It is found that the model with temperature, soil moisture, and rainfall cannot be improved with temperature and soil moisture included in the model. The model with temperature and soil moisture achieved the AUC score of 0.9017. Even though the results are high enough to predict the chances, we need to forecast the situation; this model needs the simultaneous data of input features to predict wildfire; it could be more convenient to use the model with the daily data to predict if there will be any fires the next day. In order to increase the useability of this model, we adjust the model to predict the wildfires in the next day by merging the observations of the previous day’s environmental data. We then trained the logit model with this adjusted data set, and the model with parameters of soil moisture and temperature get an AUC of 0.9002. This AUC score is close to the model with same-day data. Next, we simplify the model by reducing the feature required in prediction. We find the model with temperature input has an AUC of 0.8799. This model is still useful for predicting wildfires the next day with fewer data points. To be specific, in the confusion matrix, the model with temperature and soil moisture input has 689 types 2 errors out of 8036 total days, and the model with only temperature input has 780 type 2 errors.

library(dplyr)
data_wildfire=read.csv('final_wildfire.csv')
data_large=subset(data_wildfire, FIRE_SIZE_CLASS != 'A' & FIRE_SIZE_CLASS != 'B')
temp3=data_large %>% count(Year, DISCOVERY_DOY)
rainfall <- read.csv("rainfall_daily.csv")
soilmoisture <- read.csv("soilmoisture_daily.csv")
airtemp <- read.csv("airtemp_daily.csv")
airtemp$month <- strftime(airtemp$time, "%m")
airtemp$year <- strftime(airtemp$time, "%Y")
airtemp$DOY <- strftime(airtemp$time, "%j")
rainfall$year <- strftime(rainfall$time, "%Y")
rainfall$DOY <- strftime(airtemp$time, "%j")

cleaned_rainfall <- subset(rainfall, select = -c(time))
str(airtemp)
## 'data.frame':    23376 obs. of  5 variables:
##  $ time               : chr  "1950-01-01 00:00:00+00:00" "1950-01-02 00:00:00+00:00" "1950-01-03 00:00:00+00:00" "1950-01-04 00:00:00+00:00" ...
##  $ tair_day_livneh_vic: num  4.14 1.77 -3.09 -3.59 -2.75 ...
##  $ month              : chr  "01" "01" "01" "01" ...
##  $ year               : chr  "1950" "1950" "1950" "1950" ...
##  $ DOY                : chr  "001" "002" "003" "004" ...
cleaned_airtemp <- subset(airtemp, select = -c(time))
soilmoisture$year <- strftime(rainfall$time, "%Y")
soilmoisture$DOY <- strftime(airtemp$time, "%j")

cleaned_soilmoisture <- subset(soilmoisture, select = -c(time))
joined1 <- merge(cleaned_airtemp, cleaned_soilmoisture, by.x=c("year",'DOY'), by.y=c("year",'DOY'))
joined2 <- merge(joined1, cleaned_rainfall, by.x=c("year",'DOY'), by.y=c("year",'DOY'))
joined2$year=as.integer(joined2$year)
joined2$DOY=as.integer(joined2$DOY)
str(joined2)
## 'data.frame':    23376 obs. of  6 variables:
##  $ year                     : int  1950 1950 1950 1950 1950 1950 1950 1950 1950 1950 ...
##  $ DOY                      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ tair_day_livneh_vic      : num  4.14 1.77 -3.09 -3.59 -2.75 ...
##  $ month                    : chr  "01" "01" "01" "01" ...
##  $ soilmoist1_day_livneh_vic: num  18.3 18.5 18.3 18.3 18.1 ...
##  $ rainfall_day_livneh_vic  : num  0.67193 0.79325 0.07883 0.08991 0.00174 ...
joined2=subset(joined2, year>= 1992 & year<= 2015)
joined5=joined2
joined5$DOY=joined5$DOY + 1


data_large=merge(joined2,temp3, by.x=c('year','DOY'),by.y=c('Year','DISCOVERY_DOY'),all.x=T)
data_large$n[is.na(data_large$n)]=0
data_large$fire=1
data_large$fire[data_large$n==0]=0
str(data_large)
## 'data.frame':    8036 obs. of  8 variables:
##  $ year                     : int  1992 1992 1992 1992 1992 1992 1992 1992 1992 1992 ...
##  $ DOY                      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ tair_day_livneh_vic      : num  3.78 4.17 4.19 4.87 4.94 ...
##  $ month                    : chr  "01" "01" "01" "01" ...
##  $ soilmoist1_day_livneh_vic: num  20.2 20.5 21.4 23.6 25.1 ...
##  $ rainfall_day_livneh_vic  : num  0.0616 1.1337 3.0754 7.0797 10.8035 ...
##  $ n                        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ fire                     : num  0 0 0 0 0 0 0 0 0 0 ...
data_large2=merge(joined5,temp3, by.x=c('year','DOY'),by.y=c('Year','DISCOVERY_DOY'),all.x=T)
data_large2$n[is.na(data_large2$n)]=0
data_large2$fire=1
data_large2$fire[data_large2$n==0]=0

for (i in range(1992,2015)){
  tdate=max(subset(data_large2,year==i)$DOY)
  data_large2$year[data_large$year == i & data_large2$DOY==tdate]=i+1
  data_large2$DOY[data_large$year == i & data_large2$DOY==tdate]=1
}
## Warning in max(subset(data_large2, year == i)$DOY): no non-missing arguments to
## max; returning -Inf
str(data_large2)
## 'data.frame':    8036 obs. of  8 variables:
##  $ year                     : num  1992 1992 1992 1992 1992 ...
##  $ DOY                      : num  2 3 4 5 6 7 8 9 10 11 ...
##  $ tair_day_livneh_vic      : num  3.78 4.17 4.19 4.87 4.94 ...
##  $ month                    : chr  "01" "01" "01" "01" ...
##  $ soilmoist1_day_livneh_vic: num  20.2 20.5 21.4 23.6 25.1 ...
##  $ rainfall_day_livneh_vic  : num  0.0616 1.1337 3.0754 7.0797 10.8035 ...
##  $ n                        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ fire                     : num  0 0 0 0 0 0 0 0 0 0 ...

3.1 Large fire prediction model, by logit regression

model_large=glm(fire~tair_day_livneh_vic+soilmoist1_day_livneh_vic,data=data_large,family=binomial())
#summary(model_large)
xkabledply(model_large, title = paste("Logit Regression :",format(formula(model_large)) ))
Logit Regression : fire ~ tair_day_livneh_vic + soilmoist1_day_livneh_vic
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.0691 0.3117 9.8458 0
tair_day_livneh_vic 0.1461 0.0080 18.1739 0
soilmoist1_day_livneh_vic -0.3157 0.0145 -21.7253 0
library('pROC')
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
prob=predict(model_large, type = "response" )
data_large$prob=prob
h = roc(fire~prob, data=data_large)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc(h) 
## Area under the curve: 0.9017
plot(h)

library("regclass")
## Loading required package: bestglm
## Loading required package: leaps
## Loading required package: VGAM
## Loading required package: stats4
## Loading required package: splines
## 
## Attaching package: 'VGAM'
## The following object is masked from 'package:caret':
## 
##     predictors
## The following object is masked from 'package:car':
## 
##     logit
## Loading required package: rpart
## Loading required package: randomForest
## randomForest 4.7-1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
## Important regclass change from 1.3:
## All functions that had a . in the name now have an _
## all.correlations -> all_correlations, cor.demo -> cor_demo, etc.
## 
## Attaching package: 'regclass'
## The following object is masked from 'package:lattice':
## 
##     qq
xkabledply( confusion_matrix(model_large), title = "Confusion matrix for large fire predictor" )
Confusion matrix for large fire predictor
Predicted 0 Predicted 1 Total
Actual 0 3265 659 3924
Actual 1 692 3420 4112
Total 3957 4079 8036

To try to increase usability for the model, we try to make a model to predict the probability in next day:

model_large=glm(fire~tair_day_livneh_vic+soilmoist1_day_livneh_vic+soilmoist1_day_livneh_vic+rainfall_day_livneh_vic,data=data_large2,family=binomial())
#summary(model_large)
xkabledply(model_large, title = paste("Predicting Probability :", format(formula(model_large))))
## Warning in if (wtitle == "") {: the condition has length > 1 and only the first
## element will be used
Predicting Probability : fire ~ tair_day_livneh_vic + soilmoist1_day_livneh_vic + soilmoist1_day_livneh_vic +
Predicting Probability : rainfall_day_livneh_vic
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.4707 0.3245 7.6148 0
tair_day_livneh_vic 0.1508 0.0081 18.6813 0
soilmoist1_day_livneh_vic -0.2743 0.0158 -17.3927 0
rainfall_day_livneh_vic -0.1312 0.0256 -5.1318 0
library('pROC')
prob=predict(model_large, type = "response" )
data_large2$prob=prob
h = roc(fire~prob, data=data_large2)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc(h) 
## Area under the curve: 0.9002
plot(h)

library("regclass")
library("ezids")
xkabledply( confusion_matrix(model_large), title = "Confusion matrix for large fire predictor" )
Confusion matrix for large fire predictor
Predicted 0 Predicted 1 Total
Actual 0 3244 680 3924
Actual 1 689 3423 4112
Total 3933 4103 8036
model_large=glm(fire~tair_day_livneh_vic,data=data_large2,family=binomial())
#summary(model_large)
xkabledply(model_large, title = paste("Logit Regression :", format(formula(model_large))))
Logit Regression : fire ~ tair_day_livneh_vic
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.7380 0.0821 -45.5091 0
tair_day_livneh_vic 0.2833 0.0058 48.6031 0
library('pROC')
prob=predict(model_large, type = "response" )
data_large2$prob=prob
h = roc(fire~prob, data=data_large2)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc(h) 
## Area under the curve: 0.8799
plot(h)

library("regclass")
library("ezids")
xkabledply( confusion_matrix(model_large), title = "Confusion matrix for large fire predictor" )
Confusion matrix for large fire predictor
Predicted 0 Predicted 1 Total
Actual 0 3222 702 3924
Actual 1 780 3332 4112
Total 4002 4034 8036

The results suppose that with temperature rainfall and soil moisture data from today; We have AUC 0.9 for predicting the the large fire in next day If we try to make a convenient model that require only temperature to predict the probability, the AUC is 0.88.

3.2 Multinomial Regression Models

we are try to figure out that can we predict the specific fire size of a large wildfire(class c or above)

data_wildfire$STAT_CAUSE_CODE <- as.factor(data_wildfire$STAT_CAUSE_CODE)
data_wildfire$FIRE_SIZE_CLASS <- as.factor(data_wildfire$FIRE_SIZE_CLASS)
temp8= subset(data_wildfire, FIRE_SIZE_CLASS != 'A' & FIRE_SIZE_CLASS != 'B')
split <- createDataPartition(temp8$FIRE_SIZE_CLASS, p = .70, list = FALSE)
## Warning in createDataPartition(temp8$FIRE_SIZE_CLASS, p = 0.7, list = FALSE):
## Some classes have no records ( A, B ) and these will be ignored
train <- temp8[split,]
test <- temp8[-split,]

With our training data, we achieve an accuracy of 50.9% for predicting the fire size class of a wildfire based on that month’s condition. From our confusion matrix, it seems that our model is biased towards predicting that a fire is of Class A, which makes sense because smaller fires are much more frequent than larger fires.

model1 <- multinom(FIRE_SIZE_CLASS ~ tair_day_livneh_vic + 
                     soilmoist1_day_livneh_vic + 
                     rainfall_day_livneh_vic, data = train)
## Warning in multinom(FIRE_SIZE_CLASS ~ tair_day_livneh_vic +
## soilmoist1_day_livneh_vic + : groups 'A' 'B' are empty
## # weights:  25 (16 variable)
## initial  value 15344.381057 
## iter  10 value 9811.159474
## iter  20 value 9431.462254
## final  value 9431.230655 
## converged
summary(model1)
## Call:
## multinom(formula = FIRE_SIZE_CLASS ~ tair_day_livneh_vic + soilmoist1_day_livneh_vic + 
##     rainfall_day_livneh_vic, data = train)
## 
## Coefficients:
##   (Intercept) tair_day_livneh_vic soilmoist1_day_livneh_vic
## D   -1.993649          0.01593043               0.009748272
## E   -2.368006          0.02640819              -0.025071562
## F   -4.536076          0.06298458               0.055103167
## G   -3.482232          0.08401695              -0.127661125
##   rainfall_day_livneh_vic
## D             0.045647775
## E             0.099442850
## F            -0.006113015
## G             0.010006781
## 
## Std. Errors:
##   (Intercept) tair_day_livneh_vic soilmoist1_day_livneh_vic
## D   0.4347072         0.009027252                0.02342523
## E   0.5731534         0.011878893                0.03124641
## F   0.7292202         0.015333614                0.03895641
## G   1.1418168         0.023236312                0.06674467
##   rainfall_day_livneh_vic
## D              0.04289798
## E              0.05069511
## F              0.07867477
## G              0.15289228
## 
## Residual Deviance: 18862.46 
## AIC: 18894.46
train$predictions <- predict(model1, newdata = train, "class")
cm <- confusionMatrix(train$predictions, train$FIRE_SIZE_CLASS)
## Warning in levels(reference) != levels(data): longer object length is not a
## multiple of shorter object length
## Warning in confusionMatrix.default(train$predictions, train$FIRE_SIZE_CLASS):
## Levels are not in the same order for reference and data. Refactoring data to
## match.
cmdf <- as.data.frame(cm$table)
cmdf$Prediction <- factor(cmdf$Prediction, levels=rev(levels(cmdf$Prediction)))
tab <- table(train$FIRE_SIZE_CLASS, train$predictions)
round((sum(diag(tab))/sum(tab))*100,2)
## [1] 0
ggplot(cmdf, aes(Reference,Prediction, fill= Freq)) +
        geom_tile() + geom_text(aes(label=Freq)) +
        scale_fill_gradient(low="white", high="#009194") +
        labs(x = "Reference",y = "Prediction", title="Confusion Matrix for Predicted Fire Size")

test$predictions <- predict(model1, newdata = test, "class")
tab <- table(test$FIRE_SIZE_CLASS, test$predictions)
round((sum(diag(tab))/sum(tab))*100,2)
## [1] 0
cm <- confusionMatrix(test$predictions, test$FIRE_SIZE_CLASS)
## Warning in levels(reference) != levels(data): longer object length is not a
## multiple of shorter object length
## Warning in confusionMatrix.default(test$predictions, test$FIRE_SIZE_CLASS):
## Levels are not in the same order for reference and data. Refactoring data to
## match.
cmdf <- as.data.frame(cm$table)
cmdf$Prediction <- factor(cmdf$Prediction, levels=rev(levels(cmdf$Prediction)))

ggplot(cmdf, aes(Reference,Prediction, fill= Freq)) +
        geom_tile() + geom_text(aes(label=Freq)) +
        scale_fill_gradient(low="white", high="#009194") +
        labs(x = "Reference",y = "Prediction", title="Confusion Matrix for Predicted Fire Size")

We found that since the large wildfire dataset is not balanced, the model fails to predict class since the model strongly predict results as C class, this suppose that we need more data and try alternative model on predict such categroy.

With our training data, the model could not predict class since it predict all of labels to majority of class. This makes sense because there does not seem to be one predominant size of the wildfires in our dataset. And we conclude this model is not useable

4. The wildfire cases in long term

We also wanted to evaluate the long term effect on the wildfires by nature factors. We tried to predict the cases of wildfires, average fire size and total burned fire by nature factors.

4.1 predictions of Wildfire Cases, fire size and burning area

temp=aggregate(tair_day_livneh_vic~Year+month,data=data_wildfire,mean)
temp1=aggregate(soilmoist1_day_livneh_vic~Year+month,data=data_wildfire,mean)

temp2=aggregate(rainfall_day_livneh_vic~Year+month,data=data_wildfire,mean)

joined3=merge(temp,temp1,by.x=c('Year','month'),by.y=c('Year','month'))
joined3=merge(joined3,temp2,by.x=c('Year','month'),by.y=c('Year','month'))
temp1=data_wildfire %>% count(Year, month)

temp2=aggregate(FIRE_SIZE~Year+month,data=data_wildfire,mean)
joined4 = merge(temp2, temp1, by.x=c('Year', 'month'), by.y=c('Year', 'month'))
joined4 = merge(joined3, joined4, by.x=c('Year','month'), by.y=c('Year', 'month'))
library(corrplot)
## corrplot 0.92 loaded
joined4$totalarea=joined4$n*joined4$FIRE_SIZE
cor_fire=cor(joined4[c(3,4,5,6,7,8)])
corrplot(cor_fire,method='number',type = 'lower', diag = TRUE)

From the corrplot, we can find that the number of case have strong correlation with environment variable. And the fire size, total area burned have some correlation with environment variable. We try to use linear model for those predicted variable, the result suppose that the model is not fit well. Then, we take log to those predicted variable. Then the models are improved.

#Total Fire Cases
model_case=lm(log(n)~tair_day_livneh_vic++soilmoist1_day_livneh_vic,data=joined4)
#summary(model_case)
xkabledply(model_case, title = paste("Linear Regression :", format(formula(model_case))))
Linear Regression : log(n) ~ tair_day_livneh_vic + +soilmoist1_day_livneh_vic
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.8261 0.2806 31.4571 0
tair_day_livneh_vic 0.0526 0.0068 7.7415 0
soilmoist1_day_livneh_vic -0.2322 0.0126 -18.4468 0
VIF(model_case)
##       tair_day_livneh_vic soilmoist1_day_livneh_vic 
##                  3.653691                  3.653691
plot(model_case)

#Average Fire Size
model_avgsize=lm(log(FIRE_SIZE)~tair_day_livneh_vic++soilmoist1_day_livneh_vic,data=joined4)
#summary(model_avgsize)
xkabledply(model_avgsize, title = paste("Linear Regression :", format(formula(model_avgsize))))
Linear Regression : log(FIRE_SIZE) ~ tair_day_livneh_vic + +soilmoist1_day_livneh_vic
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.4512 0.8955 7.2041 0.0000
tair_day_livneh_vic 0.0409 0.0217 1.8858 0.0604
soilmoist1_day_livneh_vic -0.2994 0.0402 -7.4540 0.0000
VIF(model_avgsize)
##       tair_day_livneh_vic soilmoist1_day_livneh_vic 
##                  3.653691                  3.653691
plot(model_avgsize)

#Total Fire Area
model_totalarea=lm(log(totalarea)~tair_day_livneh_vic++soilmoist1_day_livneh_vic,data=joined4)
#summary(model_totalarea)
xkabledply(model_totalarea, title = paste("Linear Regression :", format(formula(model_totalarea))))
Linear Regression : log(totalarea) ~ tair_day_livneh_vic + +soilmoist1_day_livneh_vic
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.2773 0.9527 16.0351 0e+00
tair_day_livneh_vic 0.0936 0.0231 4.0523 1e-04
soilmoist1_day_livneh_vic -0.5316 0.0427 -12.4385 0e+00
VIF(model_totalarea)
##       tair_day_livneh_vic soilmoist1_day_livneh_vic 
##                  3.653691                  3.653691
plot(model_totalarea)

The r-squared value for model of cases is 0.9 The r-squared value for model of average fire size 0.5377 The r-squared value for model of average total burned size with 0.7825 By the plot check we found that the cases and burned area model fit the linear model assumption well.

temp=aggregate(tair_day_livneh_vic~Year+month,data=data_wildfire,mean)
temp1=aggregate(soilmoist1_day_livneh_vic~Year+month,data=data_wildfire,mean)

temp2=aggregate(rainfall_day_livneh_vic~Year+month,data=data_wildfire,mean)

joined3=merge(temp,temp1,by.x=c('Year','month'),by.y=c('Year','month'))
joined3=merge(joined3,temp2,by.x=c('Year','month'),by.y=c('Year','month'))
temp1=subset(data_wildfire, FIRE_SIZE_CLASS != 'A' & FIRE_SIZE_CLASS != 'B')
temp2=aggregate(FIRE_SIZE~Year+month,data=temp1,mean)
temp1=temp1 %>% count(Year, month)


joined4 = merge(temp2, temp1, by.x=c('Year', 'month'), by.y=c('Year', 'month'))
joined4 = merge(joined3, joined4, by.x=c('Year','month'), by.y=c('Year', 'month'))

4.2 The long-term large wildfire Prediction and Analysis

We build models to predict the large wildfire number firesize and total burned area.

library(corrplot)
joined4$totalarea=joined4$n*joined4$FIRE_SIZE
cor_fire=cor(joined4[c(3,4,5,6,7,8)])
corrplot(cor_fire,method='number',type = 'lower', diag = TRUE)

model_case=lm(log(n)~tair_day_livneh_vic++soilmoist1_day_livneh_vic,data=joined4)
#summary(model_case)
xkabledply(model_case, title = paste("Case Model:",format(formula(model_case)) ))
Case Model: log(n) ~ tair_day_livneh_vic + +soilmoist1_day_livneh_vic
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.1644 0.4773 12.9163 0
tair_day_livneh_vic 0.0904 0.0112 8.0897 0
soilmoist1_day_livneh_vic -0.2847 0.0219 -12.9789 0
plot(model_case)

VIF(model_case)
##       tair_day_livneh_vic soilmoist1_day_livneh_vic 
##                  3.733188                  3.733188
model_avgsize=lm(log(FIRE_SIZE)~tair_day_livneh_vic++soilmoist1_day_livneh_vic,data=joined4)
#summary(model_avgsize)
xkabledply(model_avgsize, title = paste("Avg. Size Model Model:",format(formula(model_avgsize)) ))
Avg. Size Model Model: log(FIRE_SIZE) ~ tair_day_livneh_vic + +soilmoist1_day_livneh_vic
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.4492 0.9322 7.9908 0.0000
tair_day_livneh_vic 0.0407 0.0218 1.8642 0.0635
soilmoist1_day_livneh_vic -0.1770 0.0428 -4.1300 0.0000
plot(model_avgsize)

VIF(model_avgsize)
##       tair_day_livneh_vic soilmoist1_day_livneh_vic 
##                  3.733188                  3.733188
model_totalarea=lm(log(totalarea)~tair_day_livneh_vic++soilmoist1_day_livneh_vic,data=joined4)
#summary(model_totalarea)
xkabledply(model_totalarea, title = paste("Burn Area Model:",format(formula(model_totalarea)) ))
Burn Area Model: log(totalarea) ~ tair_day_livneh_vic + +soilmoist1_day_livneh_vic
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.6135 1.1019 12.3546 0
tair_day_livneh_vic 0.1310 0.0258 5.0809 0
soilmoist1_day_livneh_vic -0.4616 0.0506 -9.1154 0
plot(model_totalarea)

VIF(model_totalarea)
##       tair_day_livneh_vic soilmoist1_day_livneh_vic 
##                  3.733188                  3.733188

We have the model that r-squared value for model of cases 0.8616 model of average fire size 0.3366 model of average total burned size with 0.7391 By the plot check we found that the model of large fire case does not fit well as the residual is not consistent. And the model of average large fire size result shows lack of some normality from qq-plot.

4.3. Regression of wildfires with different cause(natrual or people-caused)

First, linear modeling was used to predict the total number of fires in California, the fire size, and the burn area with respect to the environmental factors as the independent variables. These models were also looked at individually for naturally caused wildfires and those caused by people to see if there is a difference in the wildfire’s impact.

summary_nature=read.csv('summary_nature.csv')
summary_peoplecaused=read.csv('summary_peoplecaused.csv')
#Renaming variables in summary_nature
Avg_Temp <- summary_nature$tair_day_livneh_vic
Avg_SoilMoisture <- summary_nature$soilmoist1_day_livneh_vic
Avg_Rainfall <- summary_nature$rainfall_day_livneh_vic

#Renaming variables in summary_peoplecaused
Avg_Temp <- summary_peoplecaused$tair_day_livneh_vic
Avg_SoilMoisture <- summary_peoplecaused$soilmoist1_day_livneh_vic
Avg_Rainfall <- summary_peoplecaused$rainfall_day_livneh_vic

3.1 Model for all fire incidents

model_nnature <- lm(log(n)~ Avg_Temp + Avg_SoilMoisture, data = summary_nature)
#summary(model_nnature)
xkabledply(model_nnature, title = paste("Nature Caused:",format(formula(model_nnature)) ))
Nature Caused: log(n) ~ Avg_Temp + Avg_SoilMoisture
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.1533 0.3724 21.8919 0
Avg_Temp 0.0595 0.0090 6.5931 0
Avg_SoilMoisture -0.2529 0.0167 -15.1381 0
plot(model_nnature)

## Logn ~ nature variables        - PEOPLE CAUSED (P)
model_npeople <- lm(log(n)~ Avg_Temp + Avg_SoilMoisture, data = summary_peoplecaused)
#summary(model_npeople)
xkabledply(model_npeople, title = paste("People Caused:",format(formula(model_npeople)) ))
People Caused: log(n) ~ Avg_Temp + Avg_SoilMoisture
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.2814 0.3131 23.2584 0
Avg_Temp 0.0526 0.0076 6.9257 0
Avg_SoilMoisture -0.2251 0.0140 -16.0264 0
plot(model_npeople)

The first model with the total number of fires as the dependent variable had an adjusted R-squared value of %. This means that the environmental factors can explain 87% of the total variation in the number of fires when the fires were naturally caused. The R-squared value was % for the model where the fires were man-made. The coefficients of the independent variables showed that, for both the models, there was a direct relationship between the air temperature and the number of fires and the soil moisture and the number of fires. There was an indirect relationship between rainfall and the number of fires.

3.2 Model for fire size

## FIRE_SIZE ~ nature variables       - NATURE CAUSED (N)
model_sizenature <- lm(log(FIRE_SIZE) ~ Avg_Temp + Avg_SoilMoisture , data = summary_nature)
#summary(model_sizenature)
xkabledply(model_sizenature, title = paste("Nature Caused:",format(formula(model_sizenature)) ))
Nature Caused: log(FIRE_SIZE) ~ Avg_Temp + Avg_SoilMoisture
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.5705 1.0788 7.0177 0.0000
Avg_Temp 0.0291 0.0261 1.1144 0.2661
Avg_SoilMoisture -0.3783 0.0484 -7.8172 0.0000
plot(model_sizenature)

model_sizepeople <- lm(log(FIRE_SIZE) ~ Avg_Temp + Avg_SoilMoisture , data = summary_peoplecaused)
#summary(model_sizepeople)
xkabledply(model_sizepeople, title = paste("People Caused:",format(formula(model_sizepeople) )))
People Caused: log(FIRE_SIZE) ~ Avg_Temp + Avg_SoilMoisture
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.8934 1.1114 4.4029 0.0000
Avg_Temp 0.0756 0.0269 2.8048 0.0054
Avg_SoilMoisture -0.2687 0.0499 -5.3907 0.0000
plot(model_sizepeople)

Next, we modeled the environmental factors and their impact on the fire size. The environmental factors can explain % of the variation in the fire size in the naturally caused fires model, whereas only % of the variation could be explained by those factors in the people caused fires model. The relationship of the independent variables was the same with the fire size as it was with the number of fires.

3.3 Model for burning area

model_areanature <- lm(log(FIRE_SIZE*n) ~ Avg_Temp + Avg_SoilMoisture , data = summary_nature)
#summary(model_areanature)
xkabledply(model_areanature, title = paste("Nature Caused:",format(formula(model_areanature)) ))
Nature Caused: log(FIRE_SIZE * n) ~ Avg_Temp + Avg_SoilMoisture
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.7238 1.1627 13.5237 0.0000
Avg_Temp 0.0887 0.0282 3.1459 0.0018
Avg_SoilMoisture -0.6312 0.0522 -12.1022 0.0000
plot(model_areanature)

model_areapeople <- lm(log(FIRE_SIZE*n) ~ Avg_Temp + Avg_SoilMoisture , data = summary_peoplecaused)
#summary(model_areapeople)
xkabledply(model_areapeople, title = paste("People Caused:",format(formula(model_areapeople) )))
People Caused: log(FIRE_SIZE * n) ~ Avg_Temp + Avg_SoilMoisture
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.1748 1.2004 10.1423 0
Avg_Temp 0.1281 0.0291 4.4031 0
Avg_SoilMoisture -0.4938 0.0538 -9.1707 0
plot(model_areapeople)

Finally, we looked at the burn area and how environmental factors impact it. We found that % of the variation in burn size was explained in the naturally caused fires model, and % was explained in the people caused fires model. Again, the relationship between the variables stayed consistent with previous models.

When looking at the difference in the impact of the fires based on their cause, the linear models showed that regardless of if the wildfire was started naturally or was manmade, since the relationship between the variables remained consistent, the impact of both kinds of wildfires was equally as devastating.

4 Decision Tree model for MultiClass Classification

4.1 predicting all fire classes.

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble  3.1.6     ✓ purrr   0.3.4
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.1.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x randomForest::combine() masks dplyr::combine()
## x tidyr::fill()           masks VGAM::fill()
## x dplyr::filter()         masks stats::filter()
## x dplyr::lag()            masks stats::lag()
## x purrr::lift()           masks caret::lift()
## x randomForest::margin()  masks ggplot2::margin()
## x dplyr::recode()         masks car::recode()
## x purrr::some()           masks car::some()
## x readr::spec()           masks yardstick::spec()
## CART 
## 
## 151640 samples
##      4 predictor
##      7 classes: 'A', 'B', 'C', 'D', 'E', 'F', 'G' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 121312, 121313, 121313, 121312, 121310 
## Resampling results across tuning parameters:
## 
##   maxdepth  Accuracy   Kappa     
##    1        0.5242878  0.07930881
##    3        0.5242878  0.07930881
##    5        0.5242878  0.07930881
##    7        0.5242878  0.07930881
##    9        0.5242878  0.07930881
##   10        0.5242878  0.07930881
##   11        0.5242878  0.07930881
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was maxdepth = 1.

4.2 predicting only fire classes A and B using Decision Tree.

## Warning: extra=106 but the response has 7 levels (only the 2nd level is
## displayed)

##    predict_unseen
##         A     B     C     D     E     F     G
##   A 20041  1827     0     0     0     0     0
##   B 12300  1594     0     0     0     0     0
##   C  1244   215     0     0     0     0     0
##   D   300    43     0     0     0     0     0
##   E   135    23     0     0     0     0     0
##   F    96     9     0     0     0     0     0
##   G    75     8     0     0     0     0     0
## [1] "Accuracy for test 0.570693748351358"
## [1] 0.5732524
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     A     B     C     D     E     F     G
##          A 20041 12300  1244   300   135    96    75
##          B  1827  1594   215    43    23     9     8
##          C     0     0     0     0     0     0     0
##          D     0     0     0     0     0     0     0
##          E     0     0     0     0     0     0     0
##          F     0     0     0     0     0     0     0
##          G     0     0     0     0     0     0     0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5707          
##                  95% CI : (0.5657, 0.5757)
##     No Information Rate : 0.5768          
##     P-Value [Acc > NIR] : 0.9924          
##                                           
##                   Kappa : 0.0326          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E Class: F
## Sensitivity            0.9165  0.11473  0.00000 0.000000 0.000000  0.00000
## Specificity            0.1179  0.91152  1.00000 1.000000 1.000000  1.00000
## Pos Pred Value         0.5861  0.42861      NaN      NaN      NaN      NaN
## Neg Pred Value         0.5087  0.64026  0.96151 0.990952 0.995832  0.99723
## Prevalence             0.5768  0.36650  0.03849 0.009048 0.004168  0.00277
## Detection Rate         0.5286  0.04205  0.00000 0.000000 0.000000  0.00000
## Detection Prevalence   0.9019  0.09810  0.00000 0.000000 0.000000  0.00000
## Balanced Accuracy      0.5172  0.51312  0.50000 0.500000 0.500000  0.50000
##                      Class: G
## Sensitivity          0.000000
## Specificity          1.000000
## Pos Pred Value            NaN
## Neg Pred Value       0.997811
## Prevalence           0.002189
## Detection Rate       0.000000
## Detection Prevalence 0.000000
## Balanced Accuracy    0.500000

4.3 predicting fire classes A and B after tuning model fit.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     A     B     C     D     E     F     G
##          A 20272 12434  1254   304   137    96    75
##          B  1596  1460   205    39    21     9     8
##          C     0     0     0     0     0     0     0
##          D     0     0     0     0     0     0     0
##          E     0     0     0     0     0     0     0
##          F     0     0     0     0     0     0     0
##          G     0     0     0     0     0     0     0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5733          
##                  95% CI : (0.5683, 0.5782)
##     No Information Rate : 0.5768          
##     P-Value [Acc > NIR] : 0.922           
##                                           
##                   Kappa : 0.0338          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E Class: F
## Sensitivity            0.9270  0.10508  0.00000 0.000000 0.000000  0.00000
## Specificity            0.1086  0.92180  1.00000 1.000000 1.000000  1.00000
## Pos Pred Value         0.5864  0.43739      NaN      NaN      NaN      NaN
## Neg Pred Value         0.5219  0.64034  0.96151 0.990952 0.995832  0.99723
## Prevalence             0.5768  0.36650  0.03849 0.009048 0.004168  0.00277
## Detection Rate         0.5347  0.03851  0.00000 0.000000 0.000000  0.00000
## Detection Prevalence   0.9119  0.08805  0.00000 0.000000 0.000000  0.00000
## Balanced Accuracy      0.5178  0.51344  0.50000 0.500000 0.500000  0.50000
##                      Class: G
## Sensitivity          0.000000
## Specificity          1.000000
## Pos Pred Value            NaN
## Neg Pred Value       0.997811
## Prevalence           0.002189
## Detection Rate       0.000000
## Detection Prevalence 0.000000
## Balanced Accuracy    0.500000

5. Conclusion:

Through basic exploratory data analysis and modeling, this research concluded a strong relationship between the environmental factors and the wildfires, its burn area and the fire size. The linear modeling helps to show that regardless of the caus of the fire, its impact remains equally as devastating.

6. Limitations and Future Research:

The raw data on the environmental factors were recorded at a daily frequency. To conduct the research, the frequency was converted into monthly and yearly. To make a more accurate prediction about the wildfire, it would have been beneficial to look at the exact data for temperature, soil moisture, and rainfall data for the wildfire instead of a monthly or yearly average. In addition, our wildfire data only went up to 2015; it would have been useful to see how frequent and intense fires from the past 7 years have been, especially with the acceleration of global temperatures.

Through our EDA and modeling, we saw that we were dealing with incredibly imbalanced data, in that most of the fires recorded by the FPA (Fire Program Analysis) System fell into only a couple different categories. In particular, 48% of fires were caused by 3 main causes (debris burning, lightning, and miscellaneous reasons) out of 13 causes found in the dataset. In addition, just over 90% of wildfires fell into class A and B, which means that a vast majority of recorded fires were between 0 to 9.9 acres. We either need to collect more data on larger class fires or balance our data so that our models can more accurately predict wildfire causes and sizes.

In the future, it would also be interesting to take a deeper look into Class C fires and study the factors that cause larger fires and how it impacts the burn time of these larger fires. Lastly, we would be curious to see how the analytical and prediction techniques used to study the wildfires in California work for different states like Colorado and even the wildfires in Australia.

California’s fire suppression data was also not granular enough in that it did not detail where exactly the money was going. For the years that we could compare fire frequency to the budget, it did not look like the budget had any direct effect on the wildfires in California. As a result, we would recommend the State of California look into how the money is being spent. In addition, we believe that more specific data on how the budget is being spent will help with optimizing allocation to different efforts appropriately.

7. References: